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Abstract 

We propose a new analytical approximation to the x 2 kernel that converges geometrically. The analytical 
approximation is derived with elementary methods and adapts to the input distribution for optimal convergence 
rate. Experiments show the new approximation leads to improved performance in image classification and semantic 
segmentation tasks using a random Fourier feature approximation of the cxp-x 2 kernel. Besides, out-of-core 
principal component analysis (PCA) methods are introduced to reduce the dimensionality of the approximation and 
achieve better performance at the expense of only an additional constant factor to the time complexity. Moreover, 
when PCA is performed jointly on the training and unlabeled testing data, further performance improvements can be 
obtained. Experiments conducted on the PASCAL VOC 201 segmentation and the ImageNet ILSVRC 201 datasets 
show statistically significant improvements over alternative approximation methods. 
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A Linear Approximation to the x Kernel 
with Geometric Convergence 



1 Introduction 

Histograms are important tools for constructing vi- 
sual object descriptors. Many visual recognition ap- 
proaches utilize similarity comparisons between his- 
togram descriptors extracted from training and testing 
images. Widely used approaches such as fc-nearest 
neighbors and support vector machines compare the 
testing descriptor with multiple training descriptors, 
and make predictions by a weighted sum of these 
comparison scores. 

An important metric to compare histograms is the 
exponential-x 2 kernel (referred to as exp — x 2 in the 
rest of the paper), derived from the classic Pearson x 2 
test and utilized in many state-of-the-art object recog- 
nition studies IT), |2j, (3), El with excellent perfor- 
mance. However, in the current big data era, training 
sets often contains millions to billions of examples. 
Training and testing via hundreds of thousands of 
comparisons using a nonlinear metric is often very 
time-consuming. 

There are two main approaches to approximate 
the exp— x 2 to facilitate fast linear time training and 
testing. One approach is to devise a transforma- 
tion so that the x 2 function can be represented as 
an inner product between two vectors. On top of 
this transformation, the random Fourier (RF) features 
methodology |5) is used to approximate a Gaussian 
kernel. The full exp— x 2 kernel can be approximated 
by inner products on the vector after these two trans- 
formations [6|. A different approach is the Nystrom 
method |7J, which directly takes a subset of training 
examples, apply the comparison metric between an 
example and this subset and use the output as the 
feature vector (sometimes followed by principal com- 
ponent analysis (PCA)). 

In this paper, we pursue the RF research line. 
We are interested in RF because it has the potential 
of representing more complicated functions than the 
Nystrom approach, which is confined to summations 
of kernel comparisons and hard to approximate func- 
tions not of that type. However, RF has not been able 
to outperform Nystrom so far, especially on image 
data with the exp — x 2 approximation. 

We believe that one reason for the suboptimal pre- 
vious performance of RF in the exp — x 2 kernel is the 
inaccuracy in the approximation of the x 2 metric. A 
significant contribution of this paper is a new analytic 
series to approximate the x 2 kernel. The new series is 



derived using only elementary techniques and enjoys 
geometric convergence rate. Therefore, it is orders of 
magnitudes better in terms of approximation error 
than previously proposed approaches j8), |9j. Exper- 
iments show that this better approximation quality 
directly translate to better classification accuracy by 
using it in conjunction with the RF method to approx- 
imate the exp —x 2 kernel. 

Another research question we pursue is whether 
we can also improve the empirical performance of 
RF by applying PCA on the generated features. By 
applying PCA, the theoretical convergence rate of 
RF is no longer confined by the Monte Carlo rate 
0(1 /yd), where d is the number of dimensions used 
in the approximation. Rather, it becomes dependent 
on eigenvalues, and with a fast enough eigenvalue 
decay rate, the convergence rate can reach 0(1 /d) or 
better (10) , raising it to the at least the same level as 
the Nystrom approach. The question is then whether 
applying PCA on RF would translate to a comparable 
(or better) empirical performance. 

For this question, we exploit out-of-core versions 
of PCA that add little computational overhead to 
the RF approximation, especially when combined 
with least squares and other quadratic losses, e.g. 
group LASSO. PCA allows us to reduce the number 
of dimensions required for classification and relaxes 
memory constraints when multiple kernels have to 
be approximated by RF. We also explore the use of 
unlabeled (test) data in order to better estimate the co- 
variance matrix in PCA. This turns out to improve the 
performance by better selecting effective frequency 
components. 

The paper is organized as follows: Section 2 sum- 
marizes related work. Section 3 describes the x 2 
kernel, where we elaborate the connection between 
the exp ~x 2 kernel and the x 2 test. In Section 4, 
we present the new analytical approximation with 
geometric convergence rate. Section 5 describes the 
out-of-core PCA, Section 6 presents experiment results 
on PASCAL VOC 2010 and ImageNet ILSVRC 2010 
data, and Section 7 concludes the paper. 

2 Related Work 

To our knowledge, the use of the x 2 kernel for his- 
togram comparison can be traced back to at least 
1996 1 11]. p2J constructed the exp— x 2 kernel and 
used it in SVM-based image classification. They hy- 
pothesized that exponential x 2 is a Mercer kernel, 
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but the real proof was not available until 2004 in 
the appendix of fl3). The x 2 kernel and cxp — x 2 has 
been used in a number of visual classification and 
object detection systems |T), |2j, |3j, Q and has been 
shown to have one of the best performances among 
histogram kernels |14|. |15| proposes an extension to 
the x 2 kernel that normalizes the \ 2 cross different 
bins. Other metrics for histogram comparison include 
histogram intersection, where an efficient speed-up 
for testing has also been proposed |16], Hellinger ker- 
nel, earth mover distance [17] and Jenson-Shannon. 
See [8] for a summary and comparisons. 

Random Fourier features were proposed by |5j 
on translation-invariant kernels. [6] generalizes it to 
the exp —x 2 kernel by the aforementioned two-steps 
approach. Several other studies on linear kernel ap- 
proximations also used ideas in RF [8], [9], |18|. 



The Nystrom method [7] sub-samples the training 
set and operate on a reduced kernel matrix. Its asymp- 
totic convergence rate had long known to be slow 1 19 ], 
but recent papers have proved that it is actually faster 
than the Monte Carlo rate of RF [20]. Other speed-ups 
to kernel methods based on low-rank approximations 
of the kernel matrix have been proposed in [21], |22[ 

A topic of recent interest is methods for coding 
image features, where the goal is to achieve good 
performance using linear learners following a feature 
embedding |23[, [24 1. Hierarchical coding schemes 
based on deep structures have also been proposed 
|25|. Both sparse and dense coding schemes have 
proven successful, with supervector coding p6) and 
the Fisher kernels [27] some of the best performers 
in the ImageNet large-scale image classification chal- 
lenge [28]. The dictionaries of some influential coding 
schemes are usually extremely large - both the Fisher 
kernel and supervector coding usually require more 
than 200k dimensions |29"| ) and the training of the 
dictionary is often time-consuming. RF and Nystrom 
do not require training, hence they are interesting 
alternatives to these methods. 

A crucial component in many coding algorithms is 
a max-pooling approach, which uses the maximum of 
the coded descriptors in a spatial range as features. 
Since in this case an informative small patch could 
have the same descriptor as the whole image, it is 
desirable in image classification (for highlighting im- 
portant regions) but undesirable for object detection 
and semantic segmentation problems, where the size 
and shape of the object is of interest. A recent second- 
order pooling scheme [30 1 proposes an alternative 
and has shown successful results in the semantic 
segmentation problem. 

3 The x 2 kernel and its Relationship 
with the x 2 Test 

o denotes element-wise products of vectors. ? denotes 
an element-wise division of b from a. 



The x 2 kernel is derived from Pearson's x 2 test. 
The original Pearson x 2 test is for testing whether 
an empirical histogram estimate matches a proba- 
bility distribution. Given a histogram estimate x = 
[xi, X2, ■ ■ ■ , Xd], the test statistic is 



* 2 (x,E) = ]T 



(xi - Eif 



i=l 



E; 



(1) 



where E = [E 1 ,E 2 ,..., E d ] is the theoretical frequency 
in the bins. 

Suppose we have two histogram estimates x and y, 
one can arrive at a symmetric version by taking the 



harmonic mean H(x, y) 
sum it up: 



l/x+l/y 



of each bin and 



X 



i=l 



(xi - Vi) 2 {Vi - x t f 



1 ^ {Xj - Vi) 2 

2 hi x * + yi 



(2) 



The virtue of such a harmonic mean approach lies in 
removing the singular points in the kernel: the value 
of the original x 2 test goes to infinity when E. L = 
and Xi 7^ 0. Using the harmonic mean approach in 
|2}, the function is well-defined in all cases. 

In order to use the X 2 test jlj to determine good- 
ness of the fit, one needs to compute the p-value of 
the X 2 statistic: 



p=l 



k X 2 

P(-,— 
v 2 2 ' 



(3) 



where k is the degree of freedom in the distribution, 
P(k, x) is the regularized Gamma function. The p- 
value is 1 minus the cumulative distribution function 
(CDF) the test statistic. If a p-value is small, then 
it means the observed statistic is very unlikely to 
happen under the hypothesized distribution. A usual 
criterion is to decide that x disagrees from the dis- 
tribution specified by [E\, . . . ,Ed] if p < 0.05. In the 
case of the x 2 test, with a special case of k — 2, one 
has p = exp(— ^-), which coincides with the exp— x 2 
kernel. Since the p-value is the relevant metric for 
comparing two distributions, the exp — x 2 kernel can 
be considered intuitively better than the x 2 function 
as a similarity metric comparing two histogram dis- 
tributions. Empirically, we have tested kernels with 
different degrees of freedom, and found out that 
exp— x 2 works similarly to erf(x 2 ) (corresponding to 
X 2 with 1 degree of freedom) while outperforming all 
others with more than 2 degrees of freedom. 

4 An Analytical Approximation to the 
x 2 kernel with geometric conver- 
GENCE 

In the following we show an analytical approximation 
to the x 2 kernel. We start with the one-dimensional 
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(4) 



case. The x 2 kernel in one dimension has the form: 

(xi-yt) 2 1, , 2xiUi 

2{x l + y i ) 2 Xi+y, 

Because £Cj = 1 and £\ Vi — 1 m a histogram, 
the first ccj + form sums to a constant. It is thus 
important to represent the form £^ into an inner 
product. We will make repeated use of the following 
crucial formula 



2xy x — k y — k 2xy 
x+ky+kx+y 
!■ i) — k 2xy 



x + y x + k y + 
ky- 



1 



x ■ 



k y — k\ 2xy 



x ■ 
2y/kx 2- 



-ky 



■+ky+kx+y x + k y + k 



- y 

(5) 



which gives a one-term linear approximation of the 
2 kernel with c a 



X 



2*/k: 



Repeatedly plugging |5 
into the in the first term of the right-hand-side in 
gives us a series with multiple parameters: 

2y/kix x — k\ 2\fhix x — k\ x — k2 2\[h$x 
x + k\ ' x + k\ x + k2 ' x + k\ x + k2 x + k^ 

(6) 

This series has geometric convergence rate as the N- 
term error is exactly: 



(x-ki)...(x- k N )(y -ki)...(y- k N ) 2xy 



(7) 



(x + ki)...(x + k N )(y + fci) . . . (y + k N ) x + y 

which is straightforwardly geometric if we take k = 
ki = . . . = k N , because 1 1=| | < 1, V0 < k < 1. 

We see the multiple parameters fci,fc2,... in this 
series a boon rather than a distraction, because it can 
greatly improve the convergence rate in the full range 
of [0, 1]. Note that the convergence rate is dominated 

by ( |=| j if there is only one k = ki = ... = k N . 
Although this is fast in general, it can be very slow if 
is close to 1. Two examples are: k = 1, x = 0.005 
and k = 0.005, x = 1. Apparently, there is no single 
k choice that achieves good convergence rate on the 
entire input domain [0,1]. Our solution is to utilize 
multiple different parameters to cover different re- 
gions, and combining the parameter choice with the 
input distribution of our data to achieve an optimal 
convergence rate on the entire domain of the input. 

First we establish a simple upper bound of the 
function to facilitate simpler error computation: 

< JE_ ( 8 ) 

x+y x + 1 

Now the iV-term error can be represented as: 

2xy -p 2(x -ki) ...(x- k N )x 

x + y CxCy ~ (x + ki)...(x + k N )(x + l) ( ) 

Our algorithm for finding the parameters proceeds 
greedily to eliminate the highest error peak at each 
iteration. Specifically, we choose the parameter: 

2(x — ki) . . . (x — k^)x 



k 



N+l 



arg max 



(x + ki)...{x + k N )(x + 1) 



p(x) 



where p(x) is the input distribution of x, estimated on 
each particular dataset. Such a choice reduces error 
to at the mode of the input distribution and is 
empirically tested to be superior than other greedy 
schemes such as minimizing the mean error at each 
step. In practice, p(x) is estimated using a histogram 
estimate with logarithmically spaced bins, and k^+i 
is chosen as one of the bin centers. The algorithm of 
such an implementation is shown in Algorithm [T] 

Note that the \ 2 kernel in this form coincides with 
the harmonic mean of the two vectors. Therefore, our 
approach could also be a linear approximation on 
the harmonic mean between two vectors. However, 
currently we do not know of applications of that. 

Algorithm 1 Find the parameters for the input distri- 
bution specified by feature matrix X. 
input : Feature matrix X, Parameter vector length N. 
output : parameter vector k. 
l: Compute a histogram density estimate h on all 
nonzero values in X using logarithmically spaced 
bins in the range [min ie x,i/oX,maxX], denote 
the vector of bin centroids as x. 
oh 

N do 



b = , 

x+l 

for i = l 



kj 
b 



Xj,j = arg maxj 



bo 



end for 



-ki 
x+fe,- 



(10) 



5 Principal Component Analysis of 
Random Features on Multiple Descrip- 
tors 

Another rather orthogonal strategy we pursue is prin- 
cipal component analysis after obtaining random fea- 
tures, and solving regression problems after the PCA. 
Care needs to be exercised when PCA is performed on 
an extremely large-scale dataset in conjunction with 
multiple kernels. Although similar approaches have 
been discussed extensively in the high-performance 
computing literature ((e.g., I31J), we have not found 
such treatments detailed in the vision community, 
especially in the context of linear approximations for 
kernel methods. 

The main advantage of using PCA after RF (here- 
after called RF-PCA) is to reduce the memory foot- 
print. It is known that the performance of RF im- 
proves when more random dimensions are used. 
However, the speed of learning algorithms usually 
deteriorates quickly when the data cannot be load in 
memory, which would be the case when the RF of 
multiple kernels are concatenated: e.g. with 7 kernels 
and 7,000 RF dimensions for each kernel, the learning 
phase following RF needs to operate on a 49,000 
dimensional feature vector. 

Using eigenvectors is also one of the very few 
approaches that could provide a better asymptotic 
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convergence rate than the 0{^=) for Monte Carlo, 
thus requiring fewer dimensions for an approximation 
of the same quality. Many other techniques like quasi- 
Monte Carlo suffer from the curse of dimensionality 
- the convergence rate decreases exponentially with 
the number of input dimensions |32], which generally 
makes it unsuitable for RF which is supposed to work 
on high-dimensional problems. 

Another interesting aspect of RF-PCA is it can bring 
an unexpected flavor of semi-supervised learning, 
in that one can use unlabeled test data to improve 
classification accuracy. RF-PCA amounts to selecting 
the relevant dimensions in the frequency domain, by 
considering both the training and testing data during 
PCA, frequencies that help discriminate test data will 
more likely be selected. In the experiments such a 
strategy will be shown to improve performance over 
the computation of PCA only on training data. 

One main problem is, in a large training set, the 
feature matrix cannot be fully loaded into memory. 
Therefore PCA needs to be performed out-of-core, 
a high-performance computing term depicting this 
situation (unable to load data into memory). The way 
to do PCA in linear time is not by singular value 
decomposition on the RF features Z, but rather by per- 
forming eigenvalue decomposition for the centered 
covariance matrix Z T (I - j- l ll T )Z. Z T Z = ^ z f z i 
can be computed out-of-core by just loading a chunk 
of Xi into memory at a time, compute their RF 
feature Z, compute the covariance matrix and then 
delete the RF features from memory. Then an eigen- 
decomposition gives the transformation matrix U for 
PCA. We denote U as the matrix obtained by selecting 
the first D dimensions of U corresponding to the 
largest eigenvalues. Denote the mean vector of the 
input matrix Z = ^Z T 1, and 

Z = (Z- 1Z T )U = (I- -11 T )ZU (11) 

n 

is the feature vector obtained after PCA projection 
(Algorithm[2| . It is very convenient to perform regres- 



Algorithm 2 Out-of-Core Principal Component 
Analysis. 

input : n x d data matrix X — [Xf, Xj, . . . , X^] T . 
Output vector y. Number of dimension D to retain 
after PCA. 

1: Divide the data into k chunks, called X(i), 

X (2) ,..., X {k) . 
2: H = 0,m = 0,i> = 
3: for i = 1 — > k do 

4: Load the i-th chunk Xu\ into memory. 

5: Use Algorithm ?? to compute the RF feature Z^ 

for X(i). 

6: H = H + Zj^Z (i) , m = m + Z^l, v = v + Zj^y 

7: end for 

8: H = H - -mm T . 

n 

9- Compute eigen-decomposition H = ITDU . Out- 
put the first D columns of U as U, the diagonal 
matrix D, and the input-output product v. 



sion with a quadratic loss after PCA, since only the 
Hessian is needed for optimization. This applies not 
only to traditional least squares regression, but also 
to the LASSO, group LASSO, and other composite 
regularization approaches. In this case the projections 
need not be performed explicitly. Instead, notice that 
only Z T Z and Z T y are needed for regression: 

Z T Z = U T Z T (I - -11 T )ZU 
n 

Z T y = U T Z T (I--ll T )y (12) 
n 

It follows that only Z T Z, Z T 1 and Z T y have to 
be computed. All terms can be computed out-of-core 
simultaneously. Algorithm [3] depicts this scenario. 



Algorithm 3 Learning after PCA with Quadratic 

Loss. 

input : n x d data matrix X = [Xf , Xj, . . . , X^] T . 



Under this PCA approach the data is loaded only once 
to compute the Hessian. Additional complexity of 
0(D 3 ) is necessary for matrix decomposition on H. If 
ridge regression is used, the H' after decomposition is 
diagonal therefore only O(D) is needed to obtain the 
regression results. In this case the additional constant 
factor is quite small. The bottleneck of this algorithm 
for large-scale problems is undoubtedly the compu- 
tation of the initial Hessian, which involves reading 
multiple chunks from disk. 

The more sophisticated case is when PCA needs to 
be performed separately on multiple different kernel 
approximators, i.e., Z = [Z^Z^ . . . Z^], where each 
ZW is the RF feature embedding of each kernel. This 
time, the need to compute Z^ Z^ rules out tricks 
for simple computation. The data needs to be read in 
twice (Algorithm |4j, first to perform the PCA, and 
then use U to transform X in chunks in order to 
obtain Z and Z T Z. But the full computation is still 
linear in the number of training examples. In both 
cases, the projection is not required for the testing 
examples. Because whenever w is obtained, w T Z = 
w T U(Z — ^Z1 T ), then Uw can be the weight vector 
for the original input, with the addition of a constant 
term. 



Output vector y. Number of dimension D to retain 
after PCA. 

1: Perform out-of-core PCA using Algorithm [2] 
2: H' = U T HU — D, the first D rows and columns 
of the diagonal matrix D. 

3: v' = U T v- \ {l T y)U T m. 

4: Perform learning on H',v', e.g., for linear 
ridge regression where the optimization is 
argmiiLu, \\w T Z — y\\ 2 + A||w;|| 2 , the solution is w — 
(H' + A/)- V. 

5: Use U T w instead of w as a function of the original 
inputs: f(x) = w T Ux— ^w T Um, in order to avoid 
the projection for the testing examples. 
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Algorithm 4 Two-stage Principal Component 
Analysis when learning with multiple kernels. 

input : n x d data matrix X = [X[ , X£, . . . , X^] T . 
Output vector y. Number of dimension D to retain 
after PCA. 

1: Perform out-of-core PCA using Algorithm [2] 
2: for i = 1 — » k do 

3: Load the i-th chunk X^ into memory. 

4: Use Algorithm 1 to compute the RF feature Zu\ 

for X(j\, with the same randomization vectors 

w as before. 

5: Z =(Z (l) -}lm T )U. 

6: H' =H' + Z T Z, v' = v' + Z T y 

7: end for 

8: Perform learning on H',v'. E.g., for linear 
least squares where the optimization is 
argmiiiu, \\w T Z — y\\ 2 , the solution is w = H'~ x v' . 

9: Use U T w instead of w as a function of the original 
inputs: f(x) — w T Ux— -w T Um, in order to avoid 
the projection step for the testing examples. 



It is worth noting that out-of-core least squares 
or ridge regression scales extremely well with the 
number of output dimensions c, which can be used 
to solve one-against-all classification problems with c 
classes. In the out-of-core case, Z T y will be computed 
in 0(nDc) time along with the Hessian in Algorithm 
|2]or|4] After the inverse of Hessian is obtained, only a 
matrix-vector multiplication costing 0(D 2 c) is needed 
to obtain all the solutions, without any dependency 
on n. Thus the total time of this approach with c 
classes is 0(nDc + D 2 c) which scales very nicely 
with c. Especially compared with other algorithms 
that need to perform the full training procedure on 
each class. Although the Li loss is not optimal for 
classification, in large-scale problems (e.g. ImageNet) 
with 1,000 — 10,000 classes, the out-of-core ridge 
regression can still be used to generate a fairly good 
baseline result quickly. 

6 Experiments 

Our experiments are conducted on two 
challenging datasets: PASCAL VOC 2010 (33) and 
ImageNet |28] ILSVRC 2010 fottp://wwv\rimage : | 
net.org/challenges/LSVRC720Wf - These 
benchmarks reveal the performance differences 
among approximation methods, which would 
otherwise be difficult to observe in simple datasets. 
We conduct most experiments on the medium-scale 
PASCAL VOC data in order to compare against 
exact kernel methods. For this dataset, we use 
exclusively the train and val datasets, which 
have 964 images and around 2100 objects each. 
Classification results are also shown on the ImageNet 
dataset to demonstrate the efficiency of our kernel 
approximations. The experiments are conducted 



Approximation Error or 1700-dim HOG Irom VOC 2010 train 




Number of Terms Number of Terms 



Fig. 1: Comparisons on various approximations to the 
X 2 kernel. It can be seen that the new direct approxi- 
mation is converging orders of magnitude faster than 
previous approaches. 

using an Intel Xeon E5520 2.27GHz with 8 cores and 
24GB memory. The algorithm ?? is parallelized using 
OpenMP to take advantage of all cores. 

6.1 Comparing Approximations 

To test the different approximation, we consider a 
medium-scale problem from the PASCAL VOC seg- 
mentation dataset. For training, we use image seg- 
ments (obtained using the constrained parametric 
min-cuts algorithm, CPMC 1 34 ]) that best match each 
ground truth segment in terms of overlap (called best- 
matching segments) in the train set, plus the ground 
truth segments. The best-matching segments in the 
val set are used as test. This creates a medium-scale 
problem with 5100 training and 964 test segments. 

The methods tested in experiments are Chebyshev, 
VZ 1 8], Direct. For reference, we also report classifi- 
cation results for the \ 2 kernel without exponentiating 
as Chi2, as well as the skewed \ 2 kernel proposed 
in (18) as Chi2 -Skewed. Due to the Monte Carlo 
approximation, different random seeds can lead to 
quite significant performance variations. Therefore the 
experiments are all averaged over 50 trials on random 
seeds. Within each trial, the same random seeds are 
used for all methods. For PCA-Chebyshev, the initial 
sampling is done using three times the final approxi- 
mating dimensions, and PCA is performed to reduce 
the dimensionality to the same level as the other two 
methods. We test the classification performance of 
these kernels with two different types of features: a 
bag of SIFT words (BOW) feature of 300 dimensions, 
and a histogram of gradient (HOG) feature of 1700 
dimensions. The classification is done via a linear 
SVM using the LIBSVM library (empirically we found 
the LIBLINEAR library produced worse results than 
LIBSVM in this context with dense features). The C 
parameter in LIBSVM is validated to 50, the kernel to 
be approximated is exp-% 2 , with /3 = 1.5. For VZ, the 
period parameter is set to the optimal one specified 
in [8|. For each kernel, 5 dimensions are used to 
approximate the \ 2 distance in each dimension, which 
represents a common use case. 
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Number of Dimensions 


3000 


5000 


7000 


Chi2 
Chi2-Skewed 
v^iieuybiiev 

vz 

Direct 


41.91% 

39.82% ± 0.73% 

ao n**%4- n sn°/, 
*±^.uo/oin u.ou/o 

42.29%± 0.74% 

42.09%± 0.79% 


42.32% 
40.79% ± 0.55% 

42.82%± 0.63% 
42.88%± 0.63% 


42.12% 
40.90% ± 0.82% 

4? A1 °L 4- M.°/„ 
^i^.Dl /o 31 Hi 

43.00%± 0.57 % 
43.21%± 0.63% 


PCA-Chebyshev 
PCA-VZ 
PCA-Direct 


42.80%± 0.74% 
43.16%± 0.55% 


43.25%± 0.55% 
43.31%± 0.53% 


43.42%± 0.42 % 
43.53%± 0.71 % 


Exact exp-x 2 


44.19% 



TABLE 1: Classification accuracy of exp-% 2 kernel when the \ 2 function is estimated with different approx- 
imations, on a BOW-SIFT descriptor. Results for the Chi2 and Chi2-Skewed kernels are also shown for 
reference. 



Number of Dimensions 


3000 


5000 


7000 


Chi2 
Chi2-Skewed 
Chebyshev 
VZ 
Direct 


29.15% 
30.08% ± 0.74% 
30.86% ± 0.78% 
31.32% ± 0.90% 
31.71%± 0.92% 


30.50% 
30.37 % ± 0.63% 
31.53% ± 0.66% 
32.07 % ± 0.83% 
32.72%± 0.73% 


31.22% 
30.51 % ± 0.35 % 
31.90% ± 0.70% 
32.36% ± 0.62% 
32.94%± 0.66% 


PCA-Chebyshev 
PCA-VZ 
PCA-Direct 


32.59%± 0.77% 
32.94%± 0.67% 
32.92%± 0.66% 


33.11% ± 0.57% 
33.41% ± 0.54% 
33.68%± 0.57% 


33.22% ± 0.54% 
33.45%± 0.59% 
33.63%± 0.67% 


Exact exp-x^ 


34.34% 



TABLE 2: Classification accuracy of exp-% 2 kernel when the \ 2 function is approximated with different 
approximations, on a HOG descriptor. Results for the Chi2 and Chi2-Skewed kernels are also shown for a 
reference. 



Accuracy with Number of Terms on BOW feature 



Accuracy with Number of Tern 




Number of Terms 



Number of Terms 



Fig. 2: Effect on the classification accuracy on a 7000- 
dimensional RF-approximated exp —\ 2 kernel, using 
different approximations and various number of di- 
mensions to approximate the x 2 function. 



6.2 Results for Multiple Kernels on the PASCAL 
VOC Segmentation Challenge 

In this section, we consider the semantic segmen- 
tation task from PASCAL VOC, where we need to 
both recognize objects in images, and generate pixel- 
wise segmentations for these objects. Ground truth 
segments of objects paired with their category labels 
are available for training. 

A recent state-of-the-art approach trains a scoring 
function for each class on many putative figure- 
ground segmentation hypotheses, obtained using 
CPMC [?], |35| . This creates a large-scale learning 
task even if the original image database has moderate 
size: with 100 segments in each image, training for 
964 images creates a learning problem with around 
100, 000 training examples. This training set is still 
tractable for exact kernel approaches and we can 



directly compare against them. 

Two experiments are conducted using multiple ker- 
nel approximations for the exp-x 2 kernels. We use 7 
different image descriptors, which include 3 HOGs 
at different scales, BOW on SIFT for the foreground 
and background, and BOW on color SIFT for the 
foreground and background |34|, |36|. The VOC seg- 
mentation measure is used to compare the different 
approaches. This measure is the average of pixel-wise 
average precision on the 20 classes plus background. 
To avoid distraction and for a fair comparison, the 
post-processing step |34| is not performed and the 
result is obtained by only reporting one segment with 
the highest score in each image. The method used for 
nonlinear estimation is one-against-all support vector 
regression (SVR) as in |36], and the method for linear 
estimation is one-against-all ridge regression. The lat- 
ter is used since fast solutions for linear SVR problems 
are not yet available for out-of-core dense features. We 
avoided stochastic gradient methods (e.g., |26|) since 
these are difficult to tune to convergence, and such 
effects can potentially bias the results. We average 
over 5 trials of different random seeds. 

6.3 Results on ImageNet 

The ImageNet ILSVRC 2010 is a challenging classi- 
fication dataset where 1 million images have to be 
separated into 1,000 different categories. Here we only 
show experiments performed using the original BOW 
feature provided by the authors. Our goal is primarily 
to compare among different approximations, hence 
we did not generate multiple image descriptors or 
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Method 


Performance 


Chebyshev 
VZ 

PCA-Chebyshev 
PCA-training-Chebyshev 
Nystrom 


26.25% ± 0.41% 
25.50% ± 0.54% 
27.57% ± 0.44% 
26.95% ± 0.35% 
27.55% ± 0.49% 


Kernel SVR 


30.47% 



TABLE 3: VOC Segmentation Performance on the 
val set, measured by pixel AP with one segment 
output per image (no post-processing), and averaged 
over 5 random trials. The upper part of the table 
shows results on only BOW-SIFT features extracted 
on foreground and background. The lower part shows 
results based on combining 7 different descriptors. 

a spatial pyramid, which are compatible with our 
framework and could improve the results signifi- 
cantly. Since regression is used, the resulting scores 
are not well-calibrated across categories. Therefore we 
perform a calibration of the output scores to make the 
500th highest score of each class the same. 



7 Conclusion 
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here.The conclusion goes here.The 
here. 



goes 
goes 
goes 
goes 



here.The 
here.The 
here.The 
here.The 



conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 

conclusion goes 



Acknowledgments 

The authors would like to thank... The authors would 
like to thank... The authors would like to thank... The 
authors would like to thank.. .The authors would like 
to thank.. .The authors would like to thank.. .The au- 
thors would like to thank... The authors would like 
to thank.. .The authors would like to thank.. .The au- 
thors would like to thank... The authors would like to 
thank.. .The authors would like to thank.. .The authors 
would like to thank... 

References 

[1] Z. Zhang, J. T. Kwok, and D.-Y. Yeung, "Model-based trans- 
ductive learning of the kernel matrix," Machine Learning, 
vol. 63, pp. 69-101, 2006. 

[2] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, "Multi- 
ple kernels for object detection," in International Conference on 
Computer Vision, 2009. 

[3] M. Marszalek, I. Laptev, and C. Schmid, "Actions in context," 
in IEEE Conference on Computer Vision and Pattern Recognition, 
2009, pp. 2929-2936. 



[4] J. Gonfaus, X. Boix, J. V. de Weijer, A. Bagdanov, J. Serrat, 
and J. Gonzlez, "Harmony potentials for joint classification 
and segmentation," in IEEE Conference on Computer Vision and 
Pattern Recognition, 2010. 

[5] A. Rahimi and B. Recht, "Random features for large-scale 
kernel machines," in Advances in Neural Information Processing 
Systems, 2007. 

[6] V. Sreekanth, A. Vedaldi, C. V. Jawahar, and A. Zisserman, 
"Generalized rbf feature maps for efficient detection," in Pro- 
ceedings of the British Machine Vision Conference, 2010. 

[7] C. K. I. Williams and M. Seeger, "Using the nystrom method to 
speed up kernel machines," in Advances in Neural Information 
Processing Systems, 2001. 

[8] A. Vedaldi and A. Zisserman, "Efficient additive kernels via 
explicit feature maps," IEEE Transaction on Pattern Analysis and 
Machine Intelligence, vol. 34, 2012. 

[9] F. Li, G. Lebanon, and C. Sminchisescu, "Chebyshev ap- 
proximations to the histogram chi-square kernel," in IEEE 
Conference on Computer Vision and Pattern Recognition, 2012. 

[10] P. L. Bartlett, O. Bousquet, and S. Mendelson, "Local 
rademacher complexities," Annals of Statistics, vol. 33, pp. 
1497-1537, 2005. 

[11] B. Schiele and J. L. Crowley, "Object recognition using multidi- 
mensional receptive field histograms," in European Conference 
on Computer Vision, 1996. 

[12] O. Chapelle, P. Haffner, and V. Vapnik, "Support vector ma- 
chines for histogram-based image classification," IEEE Trans- 
actions on Neural Networks, vol. 10, pp. 1055-1064, 1999. 

[13] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, "Spectral 
grouping using the nystrom method," IEEE Transaction on 
Pattern Analysis and Machine Intelligence, vol. 26, pp. 214-225, 
2004. 

[14] A. Bosch, A. Zisserman, and X. Munoz, "Representing shape 
with a spatial pyramid kernel," in CIVR'07, 2007. 

[15] O. Pele and M. Werman, "The quadratic-chi histogram dis- 
tance family," in European Conference on Computer Vision, 2010. 

[16] S. Maji, A. Berg, and J. Malik, "Efficient classification for 
additive kernel svms," IEEE Transaction on Pattern Analysis and 
Machine Intelligence, vol. 35, pp. 66-77, 2013. 

[17] Y. Rubner, C. Tomasi, and L. Guibas, "A metric for distribu- 
tions with applications to image databases," in International 
Conference on Computer Vision, 1998. 

[18] F. Li, C. Ionescu, and C. Sminchisescu, "Random Fourier 
approximations for skewed multiplicative histogram kernels," 
in DAGM, 2010. 

[19] P. Drineas and M. Mahoney, "On the nystrom method for ap- 
proximating a gram matrix for improved kernel-based learn- 
ing," journal of Machine Learning Research, vol. 6, pp. 2153-2175, 
2005. 

[20] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou, "Nys- 
trom method vs random fourier features: A theoretical and 
empirical comparison," in Advances in Neural Information Pro- 
cessing Systems, 2012. 

[21] F. Bach and M. I. Jordan, "Predictive low-rank decomposition 
for kernel methods," in Proceedings of the International Confer- 
ence of Machine Learning, 2005. 

[22] S. Fine and K. Scheinberg, "Efficient svm training using low- 
rank kernel representation," Journal of Machine Learning Re- 
search, vol. 2, pp. 243-264, 2001. 

[23] H. Lee, A. Battle, R. Raina, and A. Y. Ng, "Efficient sparse 
coding algorithms," in Advances in Neural Information Process- 
ing Systems, 2007, pp. 801-808. 

[24] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, "Locality- 
constrained linear coding for image classification," in IEEE 
Conference on Computer Vision and Pattern Recognition, 2010. 

[25] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, "Convolutional 
deep belief networks for scalable unsupervised learning of 
hierarchical representations," in Proceedings of the International 
Conference of Machine Learning, 2009, pp. 609-616. 

[26] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and 
T. S. Huang, "Large-scale image classification: Fast feature 
extraction and svm training," in IEEE Conference on Computer 
Vision and Pattern Recognition, 2011. 

[27] F. Perronnin, J. Sanchez, and T. Mensink, "Improving the 
fisher kernel for large-scale image classification," in European 
Conference on Computer Vision, 2010. 



IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. X, JANUARY 2013 



8 



Number of Dimensions 


3000 


5000 


7000 


Chebyshev 
PCA-Chebyshev 
VZ 
Direct 
Nystrom 


16.30% ± 0.04% 
16.66%± 0.08% 
16.10% ± 0.04% 


17.11% ± 0.04% 
17.85%± 0.08% 
16.95 % ± 0.08% 


17.63% ± 0.09% 
18.65%± 0.10 % 
17.48% ± 0.09% 


Linear 
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TABLE 4: Performance of linear classifiers as well as non-linear approximation methods on ImageNet ILSVRC 
2010 data. Notice the significant boost provided by the non-linear approximations (exact non-linear calculations 
are intractable in this large scale setting). 
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